ANN Building Blocks part 2

BS

Learning

Estimating parameters (\(w\) and \(b\))

Linear regression (\(\approx\) single linear neuron)

  • closed form solution









ANN (arbitrary number of neurons in layers)

  • closed form does not work
  • iterative optimization algorithm (=Learning)

Neuron

Neuron

Supervised Learning

Aim

Find optimal values of \(w_{\cdot,j}\) and \(b_j\) over all neurons \(j\)

Tools

  • Loss function
    • (equiv. Cost/Error Function)
  • Gradient descent
    • Back-propagation
  • Cross-validation
    • Data
      • Training set
        • for learning
      • Validation set
        • know when to stop
      • Test set
        • quality control

Neuron

  • \(x\) = input
  • \(y\) known output corresponding to \(x\)
  • (Recall: \(\hat{y}\) is the estimated output)

Cross-validation (reminder)

Split data into

  1. training set
    • use in gradient descent during learning
  2. validation set
    • evaluate progress/convergence during learning
  3. test set
    • evaluate final result after learning

Loss Function

Suppose we have

  1. an ANN that, with input \(x\), produces an estimated output \(\hat{y}\)
  2. training samples \(X=(x^{(1)},\ldots,x^{(K)})\) with true output values \(Y=(y^{(1)},\ldots,y^{(K)})\).

Then the Quadratic Loss Function is defined as follows:

1, For each \(x\in X\), use the residual sum of squares, RSS, as an error measure

\(\begin{eqnarray*}L(w,b|x) &=& \sum_i\frac{1}{2} \left(y_i-\hat{y}_i\right)^2\end{eqnarray*}\)

2, The full quadratic cost function is simply the Mean Squared Error (MSE) used in cross-validation \[\begin{eqnarray} L(w,b) &=& \frac{1}{K} \sum_{k=1}^K L(w,b|x^{(k)})\\ \end{eqnarray}\]

Neuron

Gradient Descent

Optimization

Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.

Hill-climbing

Neuron

Gradient Descent

Optimization

Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.

Hill-climbing
  1. randomly choose direction and length to change \(v\)
  2. stay if \(L(v|x)\) got lower, else go back.
We want to be smarter!

Neuron

Gradient Descent

Optimization

Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.

Gradient descent

Neuron

Gradient Descent

Optimization

Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.

Gradient descent
  1. compute the derivative \(\frac{dL(v|x)}{dv}\) to see which way down is

Neuron

Gradient Descent

Optimization

Consider inverted hill-climbing in one dimension \(v\), i.e., we want to find the minimum instead of the maximum.

Gradient descent
  1. compute the derivative \(\frac{dL(v|x)}{dv}\) to see which way down is
  2. Take a reasonably long step in that direction, \(v' = v-\eta\frac{dL(v|x)}{dv}\)

\(\eta\) is called the learning rate

Neuron

Gradient Descent in higher dimensions

Same thing really, but we have to have partial derivatives for each dimension, which makes it look more complicated.

valley

Consider a 2-dimensional case. We will treat each dimension separately

  1. Find the partial derivatives for both dimensions \[\begin{pmatrix} \frac{\partial L(v_1,v_2|x)}{\partial v_1}\\ \frac{\partial L(v_1,v_2|x)}{\partial v_2} \end{pmatrix}\]

  2. Take a resonably long step \(\begin{eqnarray*} \begin{pmatrix} v'_1\\ v'_2\end{pmatrix} &=& \begin{pmatrix}v_1-\eta\frac{\partial L(x,w)}{\partial v_1} \\ v_2-\eta\frac{\partial L(x,v)}{\partial v_2} \end{pmatrix} \end{eqnarray*}\)

(A vector of partial derivatives is called a gradient)

Gradient Descent in higher dimensions

Same thing really, but we have to have partial derivatives for each dimension, which makes it look more complicated.

valley

More realistic parameter space

Consider a 2-dimensional case. We will treat each dimension separately

  1. Find the partial derivatives for both dimensions \[\begin{pmatrix} \frac{\partial L(v_1,v_2|x)}{\partial v_1}\\ \frac{\partial L(v_1,v_2|x)}{\partial v_2} \end{pmatrix}\]

  2. Take a resonably long step \(\begin{eqnarray*} \begin{pmatrix} v'_1\\ v'_2\end{pmatrix} &=& \begin{pmatrix}v_1-\eta\frac{\partial L(x,w)}{\partial v_1} \\ v_2-\eta\frac{\partial L(x,v)}{\partial v_2} \end{pmatrix} \end{eqnarray*}\)

(A vector of partial derivatives is called a gradient)

Gradient descent strategy

Algorithm
  1. Initialize weights and biases randomly \(\sim N(0, \sigma^2)\)
  2. Loop for \(M\) epochs or until convergence:
    • For each weight \(w_{i,j}\) and each bias \(b_j\) :
      1. Compute partial derivatives: \[\begin{eqnarray*} \frac{\partial L(w,b|x)}{\partial w_{i,j}}\\ \frac{\partial L(w,b|x)}{\partial b_{j}} \end{eqnarray*}\]
      2. Update: \[\begin{eqnarray*} w_{i,j} &=& w_{i,j} - \eta \frac{\partial L(w,b|x)}{\partial w_{i,j}}\\ b_{j} &=& b_{j} - \eta \frac{\partial L(w,b|x)}{\partial b_{j}} \end{eqnarray*}\]
  3. Return final weights and biases

For this to work, we need to be able to compute all \(\frac{\partial L(w,b|x)}{\partial v}\) efficiently


Solution: Back propagation

Back propagation – Forward pass (Skip this slide)

Neuron

\[\begin{array}{lllll} i_1 & \Rightarrow z_1 & \Rightarrow a_1 & \Rightarrow z_2 & \Rightarrow a_2 \\ \\ x=i_1 \\ &\Rightarrow i_1 \times w_1 + b_1 = z_1 \\ &&\Rightarrow \sigma(z_1) = a_1 \\ &&&\Rightarrow a_1 \times w_2 + b_2 = z_2 \\ &&&&\Rightarrow \sigma(z_2) = a_2 = \hat{y} \end{array}\]

Back propagation – Forward pass

Neuron

\(\begin{array}{lllll} \qquad\qquad\; i_1 & \qquad\quad\Rightarrow z_1 & \Rightarrow a_1 & \quad \Rightarrow z_2 & \Rightarrow a_2 \Rightarrow \widehat{y}\\ \\ \end{array}\)

\(i_1 \quad = \quad x\)

\(z_1 \quad = \quad i_1 \times w_1 + b_1\)

\(a_1 \quad = \quad \sigma(z_1)\)

\(z_2 \quad = \quad a_1 \times w_2 + b_2\)

\(\hat{y} = a_2 \quad = \quad \sigma(z_2)\)

\(= \quad 0.05\)

\(= \quad 0.05 \times 0.1 - 0.1\) \(\quad= -0.095\)

\(= \quad \sigma(-0.095)\) \(\quad = 0.476\)

\(= \quad 0.476 \times 0.3 + 0.3\) \(\quad= 0.443\)

\(= \quad \sigma(0.443)\) \(\quad= 0.609\)

Back propagation – Backward pass

Neuron


\(x \quad = \quad 0.05\)

\(i_1 \quad = \quad 0.05\)

\(z_1 \quad = \quad -0.095\)

\(a_1 \quad = \quad 0.476\)

\(z_2 \quad = \quad 0.443\)

\(a_2 \quad = \quad 0.609\)

\(y = \quad = \quad 0.01\)

Partial derivative w.r.t.:

\(\begin{array}{ccccccccc} w_2:\qquad\qquad\; & \quad & & \quad &\frac{\partial z_2}{\partial w_2} & \times &\frac{\partial a_2}{\partial z_2} &\times& \frac{\partial L(w,b|x)}{\partial a_2} &=& \frac{\partial L(w,b|x)}{\partial w_2} \qquad\qquad\\ \\ \end{array}\)

\(\frac{\partial z_2}{\partial w_2} \quad = \quad \frac{\partial \left(a_1\times w_2 +b_2\right)}{\partial w_2}\) \(\qquad\qquad\qquad = \quad a_1\) \(\qquad\qquad\qquad\qquad = \quad 0.476\)

\(\frac{\partial a_2}{\partial z_2} \quad = \quad \frac{\partial \sigma(z_2)}{\partial z_2}\) \(\qquad\qquad\qquad\qquad = \quad a_1\left(1-a_1\right)\) \(\qquad\qquad\quad = \quad 0.601(1-0.601) \quad = \quad 0.238\)

\(\frac{\partial L(w,b|x)}{\partial a_2} \quad = \quad \frac{\partial \frac{1}{2}(y - a_2)^2}{\partial a_2}\) \(\qquad\qquad\quad\; = \quad \left(a_2-y\right)\) \(\qquad\qquad\qquad = \quad 0.0599\)

\(\frac{\partial L(w,b|x)}{\partial w_2} \quad = \frac{\partial z_2}{\partial w_2} \times \frac{\partial a_2}{\partial z_2} \times \frac{\partial L(w,b|x)}{\partial a_2}\) \(\qquad = \quad 0.476 \times 0.238 \times 0.599 \quad = \quad 0.096\)

Back propagation – Backward pass

Neuron


\(x \quad = \quad 0.05\)

\(i_1 \quad = \quad 0.05\)

\(z_1 \quad = \quad -0.095\)

\(a_1 \quad = \quad 0.476\)

\(z_2 \quad = \quad 0.443\)

\(a_2 \quad = \quad 0.609\)

\(y = \quad = \quad 0.01\)

Partial derivative w.r.t.:

\(w_2:\qquad\qquad \qquad\qquad \qquad\qquad\; \frac{\partial z_2}{\partial w_2} \times\) \(\frac{\partial a_2}{\partial z_2} \times \frac{\partial L(w,b|x)}{\partial a_2}\) \(= \frac{\partial L(w,b|x)}{\partial w_2}\)

\(b_2:\qquad\qquad \qquad\qquad \qquad\qquad\; \frac{\partial z_2}{\partial b_2} \times\) \(\frac{\partial a_2}{\partial z_2} \times \frac{\partial L(w,b|x)}{\partial a_2}\) \(= \frac{\partial L(w,b|x)}{\partial b_2}\)

\(w_1:\qquad\qquad\qquad\qquad \frac{\partial z_1}{\partial w_1}\) \(\times \frac{\partial a_1}{\partial z_1} \times \frac{\partial z_2}{\partial a_1}\) \(\times \frac{\partial a_2}{\partial z_2} \times \frac{\partial L(w,b|x)}{\partial a_2}\) \(= \frac{\partial L(w,b|x)}{\partial w_1}\)

\(b_1:\qquad\qquad\qquad\qquad\ \frac{\partial z_1}{\partial b_1}\) \(\times \frac{\partial a_1}{\partial z_1} \times \frac{\partial z_2}{\partial a_1}\) \(\times \frac{\partial a_2}{\partial z_2} \times \frac{\partial L(w,b|x)}{\partial a_2}\) \(= \frac{\partial L(w,b|x)}{\partial b_1}\)

Back propagation IRL

Multiple neurons per layer

  1. Interactions between layers
  2. Requires vector and matrix multiplication

Even more complex designs

  • Requires operations on multidimensional matrices

Tensors

  • Arrays (matrices) of arbitrary dimensions (ML def)
  • Tensor operations
    • multiplication, decomposition, …
    • produce new tensors

TensorFlow

  • The forward and backward passes are viewed as
    “Tensors (e.g., layers) that flow through the network”
  • Additional twist is that tensors allow running all or chunks of test samples simultaneously

Neuron

Neuron

Summary Learning

(Quadratic) Loss function

\[\begin{eqnarray} L(w,b|x) &=& \frac{1}{2}\sum_i\left(y_i-\hat{y}_i\right)^2\\ L(w,b) &=& \frac{1}{K}\sum_{k=1}^K L(w,b|x^{(k)}) \end{eqnarray}\] - Residual sum of squares (RSS) - Mean squared error (MSE)

Gradient descent

  • “Clever hill-climbing” in several dimensions
  • Change all variables \(v\in (w,b)\) by taking a reasonable step (the learning rate) in opposite direction to the gradient \[\begin{equation} v' = v-\eta \frac{\partial L(w,b|x)}{\partial v} \end{equation}\]

Back propagation

  • Decomposition of gradients (allows storing and re-using results)
  • Efficient implementation using tensors

Activation functions revisited

Perceptron – step activation
  • Pros
    • Clear classification (0/1)
  • Why did the perceptron “fail”?
    • 1 layer \(\Rightarrow\) linear classification
    • Not meaningfully differentiable
    • a requirement for multilayer ANN

valley

Activation functions revisited

Why not use the linear function
  • Pros:
    • continuous output
      • better output “resolution”
  • Cons:
    • Not really “meaningfully” differentiable
    • Multilayer linear ann collapses into a single linear model

However, used in the output layer for regression problems!

valley

Activation functions revisited

Sigmoid activation function

  • Meaningfully differentiable
Intermediate between step and linear
  • True for most activation functions
  • Balance between pros and cons

valley

Activation functions revisited

ReLu activation function

  • Meaningfully differentiable
(A different) intermediate between step and linear
  • True for most activation functions
  • Balance between pros and cons

valley

Activation functions summary

  • Meaningfully differentiable is important
  • Often needs to balance pros and cons
  • Two main families


Sigmoid (logistic) family

Examples
  • Sigmoid
  • Tanh

ReLu family

Examples
  • ReLu
  • Leaky ReLu
  • PreLu






(More about pros and cons of different activation functionsin a later lecture)